A Portable Algorithm for Mapping Bitext Correspondence
نویسنده
چکیده
The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations (bitext maps). The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other algorithms in the literature. The algorithm is robust enough to use on noisy texts, such as those resulting from OCR input, and on translations that are not very literal. SIMR encapsulates its language-specific heuristics, so that it can be ported to any language pair with a minimal effort.
منابع مشابه
A Portable Algorithm for Mapping Bitext Correspondence
The first step in most empirical work in multilingual NLP is to construct maps of the correspondence between texts and their translations (b i t ex t maps) . The Smooth Injective Map Recognizer (SIMR) algorithm presented here is a generic pattern recognition algorithm that is particularly well-suited to mapping bitext correspondence. SIMR is faster and significantly more accurate than other alg...
متن کاملModels of Co-occurrence
A model of co occurrence in bitext is a boolean predicate that indicates whether a given pair of word tokens co occur in corresponding regions of the bitext space Co occurrence is a precondition for the possibility that two tokens might be mutual translations Models of co occurrence are the glue that binds methods for mapping bitext correspondence with methods for estimating translation models ...
متن کاملAutomatic Detection of Omissions in Translations
ADOMIT is an algorithm for Automatic Detection of OMIssions in Translations. The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information. This property allows it to deal equally well with omissions that do not correspond to linguistic units, such as might result from word-processing mishaps. ADOMIT has proven itself by discovering many errors in a hand-co...
متن کاملAn Automatic Filter for Non-Parallel Texts
Numerous cross-lingual applications, including state-of-the-art machine translation systems, require parallel texts aligned at the sentence level. However, collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features ...
متن کاملParsing Word-Aligned Parallel Corpora in a Grammar Induction Context
We present an Earley-style dynamic programming algorithm for parsing sentence pairs from a parallel corpus simultaneously, building up two phrase structure trees and a correspondence mapping between the nodes. The intended use of the algorithm is in bootstrapping grammars for less studied languages by using implicit grammatical information in parallel corpora. Therefore, we presuppose a given (...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997